Supervised Learning used to predict dry bean outcomes

Table of contents:

  1. Introduction

  2. Required Libraries

  3. Functions used throughout this notebook

  4. Data Analysis

  5. Pre Processing of Data

  6. Classification Techiques, Evaluation & Prediction

    1. Decision Tree
    2. Nearest Neighbor
    3. Naïve Bayes
    4. Support Vector Machines
    5. Neural Networks
  7. Conclusion

Introduction

This project uses Supervised Learning to predict the type of a dry bean.

We use a dataset which contains around 13600 samples. These samples have the following attributes:

As for bean types, there are 7 possible types: Seker, Barbunya, Bombay, Cali, Horoz, Sira, Dermason.

Taking advantage of the different values of these attributes, we will use different classifiers to evaluate what's the bean type that is the most likely to be the correct choice. The model should also have a decent prediction score.

Required Libraries

Functions used throughout this notebook

Data Analysis

Pre Processing of Data

Finding null or not available values

Histograms for each type of dry bean

SEKER


By analyzing the histograms above, we can see that in the Solidity histogram of the Seker bean, the majority of the beans has a 0.98 or greater value. After noticing that, we considered that values below that could be considered outliers and therefore removed. Only 14 of the 2027 Seker beans fit this description.


By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Seker bean, the majority of the beans has a 0.996 or greater value. After noticing that, we considered that values below that could be considered outliers and therefore removed. Only 39 of the 2013 Seker beans fit this description.


By analyzing the histograms above, we can see that in the Roundness histogram of the Seker bean, the majority of the beans has a 0.88 or greater value. After noticing that, we considered that values below that could be considered outliers and therefore removed. Only 48 of the 1974 Seker beans fit this description.


BARBUNYA

By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Barbunya bean, the majority of the beans has a 0.988 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 24 of the 1322 Barbunya beans fit this description.


By analyzing the histograms above, we can see that in the Eccentricity histogram of the Barbunya bean, the majority of the beans has a 0.6 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 14 of the 1298 Barbunya beans fit this description.


BOMBAY

By analyzing the histograms above, we can see that in the Solidity histogram of the Bombay bean, the majority of the beans has a 0.973 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 14 of the 522 Bombay beans fit this description.


By analyzing the histograms above, we can see that in the Roundness histogram of the Bombay bean, the majority of the beans has a value between 0.8 and 0.925. After noticing that we considered that values below 0.8 and greater than 0.925 could be considered outliers and therefore removed. Only 8 of the 508 Bombay beans fit this description.


CALI

By analyzing the histograms above, we can see that in the Perimeter histogram of the Cali bean, the majority of the beans has a value between 900 and 1225. After noticing that we considered that values below 900 and greater than 1225 could be considered outliers and therefore removed. Only 35 of the 1630 Cali beans fit this description.


By analyzing the histograms above, we can see that in the Eccentricity histogram of the Cali bean, the majority of the beans has a 0.76 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 25 of the 1595 Cali beans fit this description.


By analyzing the histograms above, we can see that in the Compactness histogram of the Cali bean, the majority of the beans has a 0.71 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 13 of the 1570 Cali beans fit this description.


By analyzing the histograms above, we can see that in the Roundness histogram of the Cali bean, the majority of the beans has a value between 0.78 and 0.895. After noticing that we considered that values below 0.78 and greater than 0.895 could be considered outliers and therefore removed. Only 21 of the 1557 Cali beans fit this description.


By analyzing the histograms above, we can see that in the ShapeFactor2 histogram of the Cali bean, the majority of the beans has a value between 0.00088 and 0.0014. After noticing that we considered that values below 0.00088 and greater than 0.0014 could be considered outliers and therefore removed. Only 19 of the 1536 Cali beans fit this description.


By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Cali bean, the majority of the beans has a 0.978 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 18 of the 1517 Cali beans fit this description.


HOROZ

By analyzing the histograms above, we can see that in the Eccentricity histogram of the Horoz bean, the majority of the beans has a 0.805 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 29 of the 1928 Horoz beans fit this description.


By analyzing the histograms above, we can see that in the Solidity histogram of the Horoz bean, the majority of the beans has a 0.97 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 65 of the 1899 Horoz beans fit this description.


By analyzing the histograms above, we can see that in the Roundness histogram of the Horoz bean, the majority of the beans has a value between 0.72 and 0.86. After noticing that we considered that values below 0.72 and greater than 0.86 could be considered outliers and therefore removed. Only 36 of the 1834 Horoz beans fit this description.


By analyzing the histograms above, we can see that in the Compactness histogram of the Horoz bean, the majority of the beans has a 0.655 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 16 of the 1798 Horoz beans fit this description.


By analyzing the histograms above, we can see that in the MinorAxisLength histogram of the Horoz bean, the majority of the beans has a 210 or lesser value. After noticing that we considered that values above that could be considered outliers and therefore removed. Only 32 of the 1782 Horoz beans fit this description.


By analyzing the histograms above, we can see that in the EquivDiameter histogram of the Horoz bean, the majority of the beans has a 296 or lesser value. After noticing that we considered that values above that could be considered outliers and therefore removed. Only 12 of the 1750 Horoz beans fit this description.


By analyzing the histograms above, we can see that in the ShapeFactor2 histogram of the Horoz bean, the majority of the beans has a 0.00139 or lesser value. After noticing that we considered that values above that could be considered outliers and therefore removed. Only 36 of the 1738 Horoz beans fit this description.


By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Horoz bean, the majority of the beans has a 0.98 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 37 of the 1702 Horoz beans fit this description.


SIRA

By analyzing the histograms above, we can see that in the Perimeter histogram of the Sira bean, the majority of the beans has a value between 700 and 890. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 82 of the 2636 Sira beans fit this description.


By analyzing the histograms above, we can see that in the MajorAxisLength histogram of the Sira bean, the majority of the beans has a value between 259 and 342. After noticing that we considered that values below 259 and greater than 342 could be considered outliers and therefore removed. Only 44 of the 2554 Sira beans fit this description.


By analyzing the histograms above, we can see that in the Solidity histogram of the Sira bean, the majority of the beans has a 0.98 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 41 of the 2510 Sira beans fit this description.


By analyzing the histograms above, we can see that in the Roundness histogram of the Sira bean, the majority of the beans has a value between 0.82 and 0.93. After noticing that we considered that values below 0.82 and above 0.93 could be considered outliers and therefore removed. Only 32 of the 2469 Sira beans fit this description.


By analyzing the histograms above, we can see that in the Compactness histogram of the Sira bean, the majority of the beans has a 0.85 or lesser value. After noticing that we considered that values above that could be considered outliers and therefore removed. Only 18 of the 2437 Sira beans fit this description.


By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Sira bean, the majority of the beans has a 0.988 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 17 of the 2419 Sira beans fit this description.


DERMASON

By analyzing the histograms above, we can see that in the Solidity histogram of the Dermason bean, the majority of the beans has a 0.98 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 61 of the 3546 Dermason beans fit this description.


By analyzing the histograms above, we can see that in the AspectRation histogram of the Dermason bean, the majority of the beans has a value between 1.27 and 1.75. After noticing that we considered that values below 1.27 and above 1.75 could be considered outliers and therefore removed. Only 56 of the 3485 Dermason beans fit this description.


By analyzing the histograms above, we can see that in the ShapeFactor4 histogram of the Dermason bean, the majority of the beans has a 0.9915 or greater value. After noticing that we considered that values below that could be considered outliers and therefore removed. Only 43 of the 3429 Dermason beans fit this description.

We can also verify that in the Roundness histogram of the Dermason bean, the majority of the beans had a 0.85 or greater value. There were 39 out of the 3386 Dermason beans that didn't fit said criteria. These values were considered outliers and were removed.

Finding duplicates

As we can see there is 62 elements of bean Horoz that are duplicated, so we need to remove them.

All duplicated values were removed.


Sampling the data

As we can see the samples of each bean type are not balanced. As the smaller sample size is 450 (Bombay), we decided to make samples of size 500.


Defining inputs and labels

Training test split data

Classification Techiques, Evaluation & Prediction

Decision Tree

Nearest Neighbor

Naïve Bayes

Support Vector Machines

Neural Networks

Conclusion

With this project we have concluded that in order to implement a Machine Learning technique, in our case Supervised Learning, it is necessary to go through many steps in order to obtain an acceptable model.

Firstly we went through an analysis phase, which is always necessary to make a good model, because as it is known "a bad dataset leads to bad models". In order to avoid that we removed outliers, duplicate and null values, this gave a more reliable and accurate dataset. In order for the tests to be uniform we sampled our data. Since the size of the data we had varied widely for each type of dry bean (not balanced) this could lead to a known problem where the precision wouldn't be realistic. For instance we could've had a model that had a precision of almost 100% but, in reality, what that model was doing is predict the same most common class for all the types of samples. To avoid the overfitting fenomena we used k-folding which splits the dataset in k different sets, this enables the usage of all the data for training and testing and, with that, the models behaves well for both.

In relation to the models to obtain the best classificators, it was necessary to do some research about their corresponding parameters. After that we used grid search, which would show the best combination of those parameters, returning the best model possible.

With the intent to evaluate the performance in each model we used many techniques: accuracy, precision, recall and f-measure. The accuracy allowed us to understand the percentage of predictions that model was correct, precision gave us the percentage of the values that the model correctly predicted as positive to the total predicted positive observations. Recall is the ratio of correctly predicted positive values out of all the positive examples in the dataset. F-measure uses an average of precision and recall, taking into account false negatives and false positives. After we looked at the results using all the classification techniques mentioned we verified that the "Decision Tree" and "Naive Bayes" were the classifier that obtained the best results with scores above 90%.

Group 36: